W3: Data Visualization

Data Visualization

Penguins Dataset

gt::gt(head(penguins))
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year
Adelie Torgersen 39.1 18.7 181 3750 male 2007
Adelie Torgersen 39.5 17.4 186 3800 female 2007
Adelie Torgersen 40.3 18.0 195 3250 female 2007
Adelie Torgersen NA NA NA NA NA 2007
Adelie Torgersen 36.7 19.3 193 3450 female 2007
Adelie Torgersen 39.3 20.6 190 3650 male 2007
  • Note that our dataset has column names
  • In ggplot2, we don’t need to use the $ operator: penguins$species
  • We use the bare column name to refer to it: species
    • bill_depth_mm : numeric
    • bill_length_mm : numeric
    • species : character

{visdat} for Exploratory Data Analysis

library(visdat)
vis_dat(penguins)

Common Plots

One Variable

  • Numeric: histogram
  • Character: bar plots

Two Variables

  • Numeric vs. Numeric: Scatterplot, line plot
  • Numeric vs. Character: Box plot

Why focus on these plots?

We build a plot one part at a time

Data +

Mapping to data +

Geometry

Think about making plots like using recipes from a cookbook: https://r-graphics.org/

One variable plots

Building a Histogram

ggplot(penguins) +

aes(x = bill_length_mm) +

geom_histogram()

Data +

Mapping to data +

Geometry

ggplot(penguins)

ggplot(penguins) +

  • We always start with ggplot()
  • The first argument to ggplot() is the data
  • We add details to the plot with the + (plus sign)

aes():

aes(x = bill_length_mm) +

  • We map data in with the aes() function
  • x is an aesthetic - it maps data to a visual property
  • In the aes() function, we use bare column names: bill_depth_mm
  • If you want to know what aesthetics to map, look at the geom documentation:

geom_histogram()

geom_histogram()

  • All geometries begin with geom_
  • geom_s require specific aesthetics
  • When in doubt, look at the documentation:
    • ?geom_histogram

Taking it one part at a time

ggplot(penguins)

Taking it one part at a time

ggplot(penguins) +
  aes(x = bill_length_mm)

Taking it one part at a time

ggplot(penguins) +
  aes(x = bill_length_mm) +
  geom_histogram()

Histogram recap

ggplot(penguins) +

aes(x = bill_length_mm) +

geom_histogram()

Bar plots

Made for categorical data. Bar plots automatically count each group for you, so you only need to provide one variable (axis).

ggplot(penguins) +

aes(x = species) +

geom_bar()

2 Variable Plots

Scatterplot

ggplot(penguins) +

aes(x = bill_length_mm, y = bill_depth_mm) +

geom_point()

Scatterplot (data)

ggplot(penguins)

Scatterplot (aesthetics)

ggplot(penguins) +
  aes(x = bill_length_mm, 
      y=bill_depth_mm) 

Scatterplot (geometry)

ggplot(penguins) +
  aes(x = bill_length_mm, 
      y=bill_depth_mm) +
  geom_point()

Note: Where to put aes()

Our code looks like this:

ggplot(penguins) +
  aes(x = bill_length_mm, y=bill_depth_mm) +
  geom_point()

Most ggplot code looks like this:

ggplot(penguins, mapping = aes(x = bill_length_mm, y=bill_depth_mm)) +
  geom_point()

Either is acceptable!

What about more than two variables?

Three Variables

ggplot(penguins) +

aes(x = bill_length_mm, y = bill_depth_mm, color = species) +

geom_point()

Additions to Basic Plots

Histogram with a plot theme

ggplot(penguins) +

aes(x = bill_length_mm) +

geom_histogram() +

theme_bw()

Histogram with options

ggplot(penguins) +

aes(x = bill_length_mm) +

geom_histogram(binwidth = 5)

Boxplot

ggplot(penguins) +

aes(x = species, y = bill_depth_mm) +

geom_boxplot()

Faceting

ggplot(penguins) +

aes(x = species, y = bill_depth_mm, color = species) +

geom_boxplot() +

facet_wrap(~island)

Multivariate Scatterplot by facet

ggplot(penguins) +

aes(x = bill_length_mm, y = bill_depth_mm) +

geom_point() + facet_wrap(~species)

Some additional options

ggplot(data = penguins) +

aes(x = bill_length_mm, y = bill_depth_mm, color = species) +

geom_point() +

labs(x = “Bill Length”, y = “Bill Depth”, title = “Comparison of penguin bill length and bill depth across species”)

Layering Geometries

geom_tile() + geom_text() = heatmap

Why is this heatmap missing boxes? Hint: look at penguin counts.

Look at the count() function and see if there’s an argument we can set to fill in the missing boxes.

penguin_counts <- count(x=penguins, species, island)
penguin_counts
# A tibble: 5 × 3
  species   island        n
  <fct>     <fct>     <int>
1 Adelie    Biscoe       44
2 Adelie    Dream        56
3 Adelie    Torgersen    52
4 Chinstrap Dream        68
5 Gentoo    Biscoe      124

Missing Values

ggplot(penguin_counts) +
  aes(x=species, 
      y=island, 
      fill=n) +
  geom_tile() +
  geom_text(aes(label=n), 
            color="white")

esquisse as a helper

Consider the esquisse package to help generate your ggplot code via drag and drop.

library(esquisse)

esquisser(penguins)

For More Practice:

R Graphics Cookbook

An excellent resource: https://r-graphics.org/